This notebook implements a comprehensive fraud detection pipeline using tree-based models, SMOTE for class balancing, and robust evaluation practices.
It draws inspiration from the principles outlined in the Fraud Detection Handbook , which emphasizes the importance of proper validation, class imbalance handling, and meaningful metrics like AUPRC in high-skew settings like fraud detection.
Data Preprocessing
Credit card fraud detection requires careful preprocessing due to the highly skewed nature of the data and the presence of anonymized features.
We standardize the Amount
feature using StandardScaler
to ensure fair weight distribution across models and drop the Time
column which does not contribute predictive power in this context.
This prepares our features for both gradient-boosting models and interpretable models like Logistic Regression.
Handling Class Imbalance with SMOTE
In highly imbalanced settings (fraud rate < 0.2%), models trained on raw data will often predict only the majority class. To address this:
We use SMOTE (Synthetic Minority Oversampling Technique) which creates new synthetic minority examples based on feature space similarity.
This allows models to generalize better without overfitting.
Unlike random oversampling, SMOTE preserves diversity in minority examples.
The Fraud Detection Handbook emphasizes SMOTE as a critical technique in offline evaluation, before temporal or live validation is considered.
Model Training
We train a series of classifiers on the balanced dataset to compare their performance in a reproducible way.
Key models explored: - XGBoost : Handles tabular, non-linear patterns efficiently. - Logistic Regression : A robust and interpretable baseline. - LightGBM & CatBoost : Fast gradient boosting models optimized for large datasets with categorical features.
Each model is trained and evaluated using a consistent pipeline for fairness.
Model Evaluation
Because accuracy is misleading in imbalanced datasets, we use the following metrics:
Precision : Accuracy of positive (fraud) predictions.
Recall : Coverage of actual fraud cases.
F1 Score : Balance between precision and recall.
ROC-AUC : General discrimination ability of the model.
AUPRC (planned) : More informative for rare event detection.
This is consistent with the Fraud Detection Handbook’s emphasis on cost-sensitive metrics.
Visualizations
Visual diagnostics help us understand model performance:
Confusion Matrix : Visualizes false positives and false negatives.
ROC Curve : Shows tradeoff between sensitivity and specificity.
(Planned) : AUPRC and threshold-based precision-recall tradeoffs for deeper insight into model confidence.
Export Results
All key outputs are stored to CSVs in Data/output/
to support:
Auditable results tracking
Integration with dashboards (e.g. Power BI, Streamlit)
Model comparison reports
Reproducibility for experiment logging
The structure supports iterative ML development as recommended in operational fraud detection systems.
2. Train XGBoost Model
We initialize and train an XGBoost classifier using the balanced dataset.
Code
# Train XGBoost
xgb = XGBClassifier(eval_metric= 'logloss' , random_state= 42 , enable_categorical= False )
xgb.fit(X_train, y_train)
# Evaluation
train_score = xgb.score(X_train, y_train)
test_score = xgb.score(X_test, y_test)
print (f"Training Score: { train_score:.5f} " )
print (f"Test Score: { test_score:.5f} " )
# Predictions
xgb_preds = xgb.predict(X_test)
xgb_preds_prob = xgb.predict_proba(X_test)[:, 1 ]
xgb_preds_full = xgb.predict(X_full)
xgb_preds_full_prob = xgb.predict_proba(X_full)[:, 1 ]
# Add predictions to original dataset
dataset3 = dataset.copy()
dataset3['predictions' ] = xgb_preds_full
dataset3['predictions_prob' ] = xgb_preds_full_prob
# Confusion Matrix and Metrics
cm = confusion_matrix(y_test, xgb_preds)
ConfusionMatrixDisplay(cm).plot()
plt.title('Confusion Matrix' )
plt.show()
metrics = pd.DataFrame({
'Accuracy' : [round (accuracy_score(y_test, xgb_preds), 4 )],
'Recall' : [round (recall_score(y_test, xgb_preds), 4 )],
'Precision' : [round (precision_score(y_test, xgb_preds), 4 )],
'F1 Score' : [round (f1_score(y_test, xgb_preds), 4 )]
})
print (metrics)
# ROC and AUC
fpr, tpr, thresh = roc_curve(y_test, xgb_preds_prob)
roc_auc = auc(fpr, tpr)
RocCurveDisplay(fpr= fpr, tpr= tpr, roc_auc= roc_auc).plot()
plt.title('ROC Curve' )
plt.show()
# Export
metrics.to_csv("Data/output/xgboost_metrics.csv" , index= False )
pd.DataFrame({'fpr' : fpr, 'tpr' : tpr, 'thresh' : thresh}).to_csv("Data/output/xgboost_roc_curve_data.csv" , index= False )
pd.DataFrame({'AUC' : [roc_auc]}).to_csv("Data/output/xgboost_auc.csv" , index= False )
dataset3.to_csv("Data/output/xgboost_predictions.csv" , index= False )
Training Score: 0.99997
Test Score: 0.99967
Accuracy Recall Precision F1 Score
0 0.9997 1.0 0.9994 0.9997
3. Logistic Regression Model
In this section, we train a Logistic Regression model using the same balanced dataset and evaluate it using the same pipeline.
Train Logistic Regression Model
We train a logistic regression model for comparison using the same dataset.
Code
# Train Logistic Regression
logreg = LogisticRegression(max_iter= 1000 , random_state= 42 )
logreg.fit(X_train, y_train)
# Evaluation
train_score = logreg.score(X_train, y_train)
test_score = logreg.score(X_test, y_test)
print (f"Training Score: { train_score:.5f} " )
print (f"Test Score: { test_score:.5f} " )
# Predictions
logreg_preds = logreg.predict(X_test)
logreg_preds_prob = logreg.predict_proba(X_test)[:, 1 ]
logreg_preds_full = logreg.predict(X_full)
logreg_preds_full_prob = logreg.predict_proba(X_full)[:, 1 ]
# Add predictions to original dataset
dataset3 = dataset.copy()
dataset3['Data/output/logreg_predictions' ] = logreg_preds_full
dataset3['Data/output/logreg_predictions_prob' ] = logreg_preds_full_prob
# Confusion Matrix and Metrics
cm = confusion_matrix(y_test, logreg_preds)
ConfusionMatrixDisplay(cm).plot()
plt.title('Confusion Matrix' )
plt.show()
metrics = pd.DataFrame({
'Accuracy' : [round (accuracy_score(y_test, logreg_preds), 4 )],
'Recall' : [round (recall_score(y_test, logreg_preds), 4 )],
'Precision' : [round (precision_score(y_test, logreg_preds), 4 )],
'F1 Score' : [round (f1_score(y_test, logreg_preds), 4 )]
})
print (metrics)
# ROC and AUC
fpr, tpr, thresh = roc_curve(y_test, logreg_preds_prob)
roc_auc = auc(fpr, tpr)
RocCurveDisplay(fpr= fpr, tpr= tpr, roc_auc= roc_auc).plot()
plt.title('ROC Curve' )
plt.show()
# Export
metrics.to_csv("Data/output/logreg_metrics.csv" , index= False )
pd.DataFrame({'fpr' : fpr, 'tpr' : tpr, 'thresh' : thresh}).to_csv("Data/output/logreg_roc_curve_data.csv" , index= False )
pd.DataFrame({'AUC' : [roc_auc]}).to_csv("Data/output/logreg_auc.csv" , index= False )
dataset3.to_csv("Data/output/logreg_predictions.csv" , index= False )
Training Score: 0.94602
Test Score: 0.94641
Accuracy Recall Precision F1 Score
0 0.9464 0.9176 0.974 0.9449
4. Random Forest Classifier
Here we implement and evaluate a Random Forest classifier.
Make Predictions and Evaluate Model
Predictions are made on the test set and evaluated using accuracy, precision, recall, and F1-score.
Code
# Train Random Forest
rf = RandomForestClassifier(n_estimators= 100 , random_state= 42 )
rf.fit(X_train, y_train)
# Evaluation
train_score = rf.score(X_train, y_train)
test_score = rf.score(X_test, y_test)
print (f"Training Score: { train_score:.5f} " )
print (f"Test Score: { test_score:.5f} " )
# Predictions
rf_preds = rf.predict(X_test)
rf_preds_prob = rf.predict_proba(X_test)[:, 1 ]
rf_preds_full = rf.predict(X_full)
rf_preds_full_prob = rf.predict_proba(X_full)[:, 1 ]
# Add predictions to original dataset
dataset3 = dataset.copy()
dataset3['Data/output/rf_predictions' ] = rf_preds_full
dataset3['Data/output/rf_predictions_prob' ] = rf_preds_full_prob
# Confusion Matrix and Metrics
cm = confusion_matrix(y_test, rf_preds)
ConfusionMatrixDisplay(cm).plot()
plt.title('Confusion Matrix' )
plt.show()
metrics = pd.DataFrame({
'Accuracy' : [round (accuracy_score(y_test, rf_preds), 4 )],
'Recall' : [round (recall_score(y_test, rf_preds), 4 )],
'Precision' : [round (precision_score(y_test, rf_preds), 4 )],
'F1 Score' : [round (f1_score(y_test, rf_preds), 4 )]
})
print (metrics)
# ROC and AUC
fpr, tpr, thresh = roc_curve(y_test, rf_preds_prob)
roc_auc = auc(fpr, tpr)
RocCurveDisplay(fpr= fpr, tpr= tpr, roc_auc= roc_auc).plot()
plt.title('ROC Curve' )
plt.show()
# Export
metrics.to_csv("Data/output/rf_metrics.csv" , index= False )
pd.DataFrame({'fpr' : fpr, 'tpr' : tpr, 'thresh' : thresh}).to_csv("Data/output/rf_roc_curve_data.csv" , index= False )
pd.DataFrame({'AUC' : [roc_auc]}).to_csv("Data/output/rf_auc.csv" , index= False )
dataset3.to_csv("Data/output/rf_predictions.csv" , index= False )
Training Score: 1.00000
Test Score: 0.99991
Accuracy Recall Precision F1 Score
0 0.9999 1.0 0.9998 0.9999
5. LightGBM Classifier
We now train a LightGBM classifier and evaluate its performance similarly.
Train LightGBM Model
We train a LightGBM model with default parameters.
Code
# Train LightGBM
lgbm = LGBMClassifier(random_state= 42 )
lgbm.fit(X_train, y_train)
# Evaluation
train_score = lgbm.score(X_train, y_train)
test_score = lgbm.score(X_test, y_test)
print (f"Training Score: { train_score:.5f} " )
print (f"Test Score: { test_score:.5f} " )
# Predictions
lgbm_preds = lgbm.predict(X_test)
lgbm_preds_prob = lgbm.predict_proba(X_test)[:, 1 ]
lgbm_preds_full = lgbm.predict(X_full)
lgbm_preds_full_prob = lgbm.predict_proba(X_full)[:, 1 ]
# Add predictions to original dataset
dataset3 = dataset.copy()
dataset3['Data/output/lgbm_predictions' ] = lgbm_preds_full
dataset3['Data/output/lgbm_predictions_prob' ] = lgbm_preds_full_prob
# Confusion Matrix and Metrics
cm = confusion_matrix(y_test, lgbm_preds)
ConfusionMatrixDisplay(cm).plot()
plt.title('Confusion Matrix' )
plt.show()
metrics = pd.DataFrame({
'Accuracy' : [round (accuracy_score(y_test, lgbm_preds), 4 )],
'Recall' : [round (recall_score(y_test, lgbm_preds), 4 )],
'Precision' : [round (precision_score(y_test, lgbm_preds), 4 )],
'F1 Score' : [round (f1_score(y_test, lgbm_preds), 4 )]
})
print (metrics)
# ROC and AUC
fpr, tpr, thresh = roc_curve(y_test, lgbm_preds_prob)
roc_auc = auc(fpr, tpr)
RocCurveDisplay(fpr= fpr, tpr= tpr, roc_auc= roc_auc).plot()
plt.title('ROC Curve' )
plt.show()
# Export
metrics.to_csv("Data/output/lgbm_metrics.csv" , index= False )
pd.DataFrame({'fpr' : fpr, 'tpr' : tpr, 'thresh' : thresh}).to_csv("Data/output/lgbm_roc_curve_data.csv" , index= False )
pd.DataFrame({'AUC' : [roc_auc]}).to_csv("Data/output/lgbm_auc.csv" , index= False )
dataset3.to_csv("Data/output/lgbm_predictions.csv" , index= False )
[LightGBM] [Info] Number of positive: 227313, number of negative: 227591
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.033456 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7395
[LightGBM] [Info] Number of data points in the train set: 454904, number of used features: 29
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499694 -> initscore=-0.001222
[LightGBM] [Info] Start training from score -0.001222
Training Score: 0.99935
Test Score: 0.99916
Accuracy Recall Precision F1 Score
0 0.9992 1.0 0.9984 0.9992
6. CatBoost Classifier
Finally, we train and evaluate a CatBoost classifier using the same pipeline.
Train CatBoost Model
This cell trains a CatBoost classifier, useful when handling categorical features.
Code
# Train CatBoost
cat = CatBoostClassifier(verbose= 0 , random_state= 42 )
cat.fit(X_train, y_train)
# Evaluation
train_score = cat.score(X_train, y_train)
test_score = cat.score(X_test, y_test)
print (f"Training Score: { train_score:.5f} " )
print (f"Test Score: { test_score:.5f} " )
# Predictions
cat_preds = cat.predict(X_test)
cat_preds_prob = cat.predict_proba(X_test)[:, 1 ]
cat_preds_full = cat.predict(X_full)
cat_preds_full_prob = cat.predict_proba(X_full)[:, 1 ]
# Add predictions to original dataset
dataset3 = dataset.copy()
dataset3['Data/output/cat_predictions' ] = cat_preds_full
dataset3['Data/output/cat_predictions_prob' ] = cat_preds_full_prob
# Confusion Matrix and Metrics
cm = confusion_matrix(y_test, cat_preds)
ConfusionMatrixDisplay(cm).plot()
plt.title('Confusion Matrix' )
plt.show()
metrics = pd.DataFrame({
'Accuracy' : [round (accuracy_score(y_test, cat_preds), 4 )],
'Recall' : [round (recall_score(y_test, cat_preds), 4 )],
'Precision' : [round (precision_score(y_test, cat_preds), 4 )],
'F1 Score' : [round (f1_score(y_test, cat_preds), 4 )]
})
print (metrics)
# ROC and AUC
fpr, tpr, thresh = roc_curve(y_test, cat_preds_prob)
roc_auc = auc(fpr, tpr)
RocCurveDisplay(fpr= fpr, tpr= tpr, roc_auc= roc_auc).plot()
plt.title('ROC Curve' )
plt.show()
# Export
metrics.to_csv("Data/output/cat_metrics.csv" , index= False )
pd.DataFrame({'fpr' : fpr, 'tpr' : tpr, 'thresh' : thresh}).to_csv("Data/output/cat_roc_curve_data.csv" , index= False )
pd.DataFrame({'AUC' : [roc_auc]}).to_csv("Data/output/cat_auc.csv" , index= False )
dataset3.to_csv("Data/output/cat_predictions.csv" , index= False )
Training Score: 0.99984
Test Score: 0.99960
Accuracy Recall Precision F1 Score
0 0.9996 1.0 0.9992 0.9996
7. Model Comparison Summary
We compare the evaluation metrics of all models side-by-side in a summary table.
8. Hyperparameter Tuning with GridSearchCV
We perform hyperparameter tuning on XGBoost using GridSearchCV to find the best parameters.
Train XGBoost Model
We initialize and train an XGBoost classifier using the balanced dataset.
Code
from sklearn.model_selection import GridSearchCV
# Define parameter grid for XGBoost
param_grid = {
'max_depth' : [3 , 5 , 7 ],
'learning_rate' : [0.01 , 0.1 ],
'n_estimators' : [100 , 200 ],
'subsample' : [0.8 , 1.0 ]
}
xgb_tune = XGBClassifier(eval_metric= 'logloss' , random_state= 42 )
grid_search = GridSearchCV(estimator= xgb_tune, param_grid= param_grid, cv= 3 , scoring= 'f1' , verbose= 1 , n_jobs=- 1 )
grid_search.fit(X_train, y_train)
print ("Best Parameters:" , grid_search.best_params_)
best_xgb = grid_search.best_estimator_
# Evaluate tuned model
xgb_tuned_preds = best_xgb.predict(X_test)
xgb_tuned_probs = best_xgb.predict_proba(X_test)[:, 1 ]
tuned_metrics = pd.DataFrame({
'Accuracy' : [accuracy_score(y_test, xgb_tuned_preds)],
'Recall' : [recall_score(y_test, xgb_tuned_preds)],
'Precision' : [precision_score(y_test, xgb_tuned_preds)],
'F1 Score' : [f1_score(y_test, xgb_tuned_preds)]
})
print (tuned_metrics)
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200, 'subsample': 0.8}
Accuracy Recall Precision F1 Score
0 0.99971 1.0 0.999421 0.999711
9. Ensemble Model (Voting Classifier)
We combine XGBoost, Random Forest, and Logistic Regression into an ensemble Voting Classifier to leverage their strengths.
Train XGBoost Model
We initialize and train an XGBoost classifier using the balanced dataset.
Code
from sklearn.ensemble import VotingClassifier
# Define base models
xgb_base = XGBClassifier(eval_metric= 'logloss' , random_state= 42 )
rf_base = RandomForestClassifier(n_estimators= 100 , random_state= 42 )
logreg_base = LogisticRegression(max_iter= 1000 , random_state= 42 )
# Create ensemble model
voting_clf = VotingClassifier(estimators= [
('xgb' , xgb_base),
('rf' , rf_base),
('lr' , logreg_base)
], voting= 'soft' )
voting_clf.fit(X_train, y_train)
ensemble_preds = voting_clf.predict(X_test)
ensemble_probs = voting_clf.predict_proba(X_test)[:, 1 ]
# Evaluate ensemble
ensemble_metrics = pd.DataFrame({
'Accuracy' : [accuracy_score(y_test, ensemble_preds)],
'Recall' : [recall_score(y_test, ensemble_preds)],
'Precision' : [precision_score(y_test, ensemble_preds)],
'F1 Score' : [f1_score(y_test, ensemble_preds)]
})
print (ensemble_metrics)
# Save to CSV
ensemble_metrics.to_csv("Data/output/ensemble_metrics.csv" , index= False )
Accuracy Recall Precision F1 Score
0 0.99978 1.0 0.999562 0.999781
10. Stacking Ensemble
We use StackingClassifier to combine base models and a meta-model for improved performance.
Train XGBoost Model
We initialize and train an XGBoost classifier using the balanced dataset.
Code
from sklearn.ensemble import StackingClassifier
# Define base learners and meta-learner
estimators = [
('rf' , RandomForestClassifier(n_estimators= 100 , random_state= 42 )),
('lr' , LogisticRegression(max_iter= 1000 , random_state= 42 )),
('xgb' , XGBClassifier(eval_metric= 'logloss' , random_state= 42 ))
]
stack_model = StackingClassifier(estimators= estimators, final_estimator= LogisticRegression(), cv= 3 )
stack_model.fit(X_train, y_train)
stack_preds = stack_model.predict(X_test)
stack_probs = stack_model.predict_proba(X_test)[:, 1 ]
stack_metrics = pd.DataFrame({
'Accuracy' : [accuracy_score(y_test, stack_preds)],
'Recall' : [recall_score(y_test, stack_preds)],
'Precision' : [precision_score(y_test, stack_preds)],
'F1 Score' : [f1_score(y_test, stack_preds)]
})
print (stack_metrics)
stack_metrics.to_csv("Data/output/stacking_metrics.csv" , index= False )
Accuracy Recall Precision F1 Score
0 0.999877 1.0 0.999754 0.999877
11. AutoML-Style Comparison Table
Aggregate metrics from all models including tuned and ensemble methods for a final comparison.
12. Deployment-Ready Script Generator
Export the best model and required transformers for future inference (e.g. in Flask or Streamlit).